A Corpus-Based Study of Phoneme Distribution in Thai
نویسندگان
چکیده
This paper presents steps in accessing Thai phoneme distribution from large-scale written Thai corpora. The data were from 12 text genres from InterBEST [1], considered the biggest Thai corpora. Each word was transliterated using the grapheme-to-phoneme software [2]. Then, frequency of words, frequency of 81 Thai phonemes in each genre, and the 95% CIs of average occurrences of each phoneme were calculated. Phonemes from any genre that did not fall within the 95% CI were counted. As a result, 3 genres whose distributions are highly incompatible with others were removed, resulting in the remaining of approximately 80% of the data. Finally, we obtained phoneme distribution of initials, finals, vowels, and tones. Importantly, 4 bigram frequencies (final-to-tone, vowel-to-final, initial-totone, and initial-to-vowel) and a trigram frequency (vowel-final-tone) were also given.
منابع مشابه
Allophone-based acoustic modeling for Persian phoneme recognition
Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...
متن کاملImproving Phoneme Sequence Recognition using Phoneme Duration Information in DNN-HSMM
Improving phoneme recognition has attracted the attention of many researchers due to its applications in various fields of speech processing. Recent research achievements show that using deep neural network (DNN) in speech recognition systems significantly improves the performance of these systems. There are two phases in DNN-based phoneme recognition systems including training and testing. Mos...
متن کاملExample-based grapheme-to-phoneme conversion for Thai
Several characteristics of the Thai writing system make Thai grapheme-to-phoneme (G2P) conversion very challenging. In this paper, we propose an Example-Based Grapheme-toPhoneme conversion approach. It generates the pronunciation of a word by selecting, modifying and combining pronunciations from syllables from training corpus. The best system achieves 80.99% word accuracy and 94.19% phone accu...
متن کاملExample-Based Grapheme-to-Phon
Several characteristics of the Thai writing system make Thai grapheme-to-phoneme (G2P) conversion very challenging. In this paper, we propose an Example-Based Grapheme-toPhoneme conversion approach. It generates the pronunciation of a word by selecting, modifying and combining pronunciations from syllables from training corpus. The best system achieves 80.99% word accuracy and 94.19% phone accu...
متن کاملFrequency of occurrence of phonemes and syllables in Thai: Analysis of spoken and written corpora
This work provides detailed frequency and distribution of Thai phonemes, biphones, and syllable types drawn from three large-scale Thai corpora (InterBEST, LOTUS-BN, and LOTUS-Cell 2.0). Comparisons are carried out to examine an extent to which linguistic variation, associated with different corpus types (written vs. spoken), affects frequency statistics and distribution patterns. Results and s...
متن کامل